feat(git): serve fresh snapshots via delta bundles#197
Conversation
90c0530 to
01883f6
Compare
01883f6 to
5f94ce2
Compare
90b6993 to
185f4ee
Compare
185f4ee to
73a7d43
Compare
| // delta bundle covering commits between the snapshot's HEAD and the mirror's | ||
| // current HEAD. When no bundle is needed (common case), the snapshot is | ||
| // streamed directly without buffering. | ||
| func (s *Strategy) serveSnapshotWithBundle(ctx context.Context, w http.ResponseWriter, reader io.ReadCloser, headers http.Header, repo *gitclone.Repository, upstreamURL string) error { |
There was a problem hiding this comment.
Another approach could be to return a header like X-Cachew-Bundle-URL that tells the client where to retrieve the bundle. Only reason I say that is that it's a bit weird to request snapshot.tar.zst and get a weird combined response.
There was a problem hiding this comment.
yeah that is fair- i like that approach
There was a problem hiding this comment.
with the new API, i'm refactoring a bit to also store bundles in s3 with an aggressive TTL. without that i'm seeing occasional errors when the client tries to retrieve the bundle and hits a different cachew node that doesn't have it. it's not the worst but when that happens no bundle gets applied and the snapshot can be up to an hour old
| } | ||
|
|
||
| func (s *Strategy) createBundle(ctx context.Context, repo *gitclone.Repository, baseCommit string) ([]byte, error) { | ||
| return gitclone.WithReadLockReturn(repo, func() ([]byte, error) { //nolint:wrapcheck // error is already wrapped inside the closure |
There was a problem hiding this comment.
Did we figure out if we actually need read locks when doing operations like this?
There was a problem hiding this comment.
no we do not- gh handles it. removed
Workstations now receive a snapshot at HEAD instead of a potentially stale cached snapshot. On each snapshot request, cachew: 1. Fetches the mirror synchronously (O(delta), cheap) 2. Serves the cached snapshot tar.zst (bulk of the repo, fast) 3. Appends a git bundle of commits between the snapshot's HEAD and the mirror's current HEAD (O(delta), cheap) The response uses a header-framed format: - Content-Type: application/x-cachew-snapshot - X-Cachew-Snapshot-Size: byte length of the snapshot portion - Body: [snapshot.tar.zst][delta.bundle] The periodic snapshot job still regenerates the full snapshot on interval, bounding the bundle size. This avoids the expensive O(repo-size) tar+zstd regeneration on the hot path while ensuring workstations always start near HEAD. Co-authored-by: Amp <amp@ampcode.com> Amp-Thread-ID: https://ampcode.com/threads/T-019d02ea-462f-733a-bef4-03ee9e2d23c8
Instead of concatenating the bundle into the snapshot response body, return an X-Cachew-Bundle-URL header pointing to a new /snapshot.bundle endpoint. This gives cleaner HTTP semantics and lets clients fetch the bundle in parallel with snapshot extraction. Co-authored-by: Amp <amp@ampcode.com> Amp-Thread-ID: https://ampcode.com/threads/T-019d0383-345f-7659-8538-30bc7c9a7e6d
When a bundle is generated (either proactively during snapshot serving or on-demand at the bundle endpoint), cache it in S3 with a 2h TTL. The bundle endpoint checks cache first, so any pod can serve bundles without needing the local mirror. Eliminates 404s when the bundle request is load-balanced to a different pod. Co-authored-by: Amp <amp@ampcode.com> Amp-Thread-ID: https://ampcode.com/threads/T-019d0383-345f-7659-8538-30bc7c9a7e6d
73a7d43 to
1da3e98
Compare
| repoPath, err := gitclone.RepoPathFromURL(upstreamURL) | ||
| if err == nil { | ||
| bundleURL := fmt.Sprintf("/git/%s/snapshot.bundle?base=%s", repoPath, snapshotCommit) | ||
| w.Header().Set("X-Cachew-Bundle-Url", bundleURL) |
| if _, err = io.Copy(w, reader); err != nil { | ||
| logger.ErrorContext(ctx, "Failed to stream snapshot", "upstream", upstreamURL, "error", err) | ||
|
|
||
| bundleData, err := s.createBundle(ctx, repo, base) |
There was a problem hiding this comment.
I don't quite understand why this wouldn't just return an io.Reader rather than buffering?
| return nil, errors.Wrapf(err, "git bundle create: %s", string(output)) | ||
| } | ||
|
|
||
| data, err := os.ReadFile(bundlePath) //nolint:gosec // bundlePath is a temp file we created |
There was a problem hiding this comment.
There's no need to buffer this. The file can be deleted while open, and the reader will continue.
Summary
Serves fresh git snapshots by supplementing cached S3 snapshots with small delta bundles, avoiding expensive full snapshot regeneration on every request.
Problem
For busy repos, the cached snapshot becomes stale quickly. Previously, detecting staleness meant regenerating the entire snapshot from scratch — effectively invalidating the cache on every request. The regenerated snapshot would itself be stale by the next request.
Solution
When the cached snapshot HEAD differs from the local mirror HEAD, cachew:
application/zstd)X-Cachew-Bundle-Urlresponse header pointing to a separate/snapshot.bundleendpointDelta bundles are cached in S3 (2h TTL) so any cachew pod can serve them, eliminating cross-pod 404s.
Key details
application/zstd. Bundle is served at a separate URL asapplication/x-git-bundle. Fully backward compatible — old clients ignore the header.git bundle createorgit rev-parse— git handles its own file-level locking, consistent withserveFromBackend.Deploy order
Cachew first (backward compatible), then blox.